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Abstract. Clustering is an unsupervised learning technique in which 
data or objects are grouped into sets based on some similarity mea¬ 
sure. Most of the clustering algorithms assume that the main memory 
is inhnite and can accommodate the set of patterns. In reality many 
applications give rise to a large set of patterns which does not fit in 
the main memory. When the data set is too large, much of the data 
is stored in the secondary memory. Input/Outputs (I/O) from the disk 
are the major bottleneck in designing efficient clustering algorithms for 
large data sets. Different designing techniques have been used to design 
clustering algorithms for large data sets. External memory algorithms 
are one class of algorithms which can be used for large data sets. These 
algorithms exploit the hierarchical memory structure of the computers 
by incorporating locality of reference directly in the algorithm. This pa¬ 
per makes some contribution towards designing clustering algorithms in 
the external memory model (Proposed by Aggarwal and Vitter 1988) 
to make the algorithms scalable. In this paper, it is shown that the 
Shared near neighbors algorithm is not very I/O efficient since the com¬ 
putational complexity is same as the I/O complexity. The algorithm is 
designed in the external memory model and I/O complexity is reduced. 
The computational complexity remains same. We substantiate the theo¬ 
retical analysis by showing the performance of the algorithms with their 
traditional counterpart by implementing in STXXL library. 
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1 Introduction 

Clustering is an unsupervised learning technique in which data or objects are 
grouped into sets based on some similarity measure. The data points in a group 
are similar and the points across the groups are dissimilar. There are few typical 
requirements for a good clustering technique in data mining [16,9]. Versatility, 
ability to discover clusters with different shapes, minimum number of input 
parameters, robustness with regard to noise, insensitive to the data input order, 
scalable to high dimensionality, scalable to large data sets are the important 
requirements. 



2 Nearest Neighbor based Clustering Algorithms for Large Data Sets 

1.1 Clustering of Large Data Sets 


The performance of the algorithm should not decrease with the increase of the 
data size. Most of the clustering algorithms are designed for small data sets 
and they fail to fulfill the last requirement i.e. scalable to large data sets. Many 
scientific, engineering and business applications frequently produces very large 
data sets [1]. The definition of “Large” varies with the changes in technology, 
mainly the memory and the computational speed of the computers. The data 
set which is large in today’s computing environment may not remain as large 
after a few years. However the data size is increasing at much faster than the 
technology to handle it. Majority of the clustering algorithms are not designed 
to handle large data sets. There are a few approaches proposed in the literature 
to handle large data sets, e.g.. Decomposition and Incremental approaches [10, 
12]. Parallel implementation is also used to handle large data sets. 

Few algorithms are also devised in the literature which use preprocessing 
steps like summarization, incremental, approximation, distribution etc., to effi¬ 
ciently cluster large data sets. With the help of preprocessing steps, they actually 
store the summary of the data set and generate the clusters only considering 
the summary. Few examples include: Balanced Iterative Reducing and Clus¬ 
tering Using Hierarchies (BIRCH algorithm) [17], CLARANS [15], Clustering 
Using Representatives (CURE) [8], scalable K-mean-b-1- [4], etc. The BIRCH 
and CLARANS algorithms are suitable when the clusters are convex or spher¬ 
ical shape of uniform size. However, they compromise with the quality when 
clusters have different sizes or non-spherical shapes [17]. Also random sampling 
and randomized search, are used by these algorithms which degrade the quality 
of the clustering because all the data points are not considered [18, 7]. 

In the traditional algorithm design, it is assumed that the main memory 
is infinite and it allows uniform and random access to all its locations. But in 
reality the present day computers have multiple levels of memory and accessing 
data from each level has its own cost and performance characteristics. If the 
data is too large to fit in the main memory then it has to be stored in the disk 
of the machine. Disk access time is millions times slower than the main memory 
access time. Most of the clustering algorithms assume that the main memory is 
large enough for the data set. However for large data sets, this is not a realistic 
assumption. So in case of large data, the usual computational cost may not be 
an appropriate performance metric but number of input/outputs (I/Os) can be 
more appropriate performance measure. Different designing techniques are used 
to design algorithms for large data sets. External memory algorithms are one 
such class of algorithms which exploits the hierarchical memory structure of the 
computers by incorporating locality of reference directly in the algorithm [2]. The 
external memory model was introduced by Aggarwal and Vitter in 1988. The 
Input/Output model (I/O-model) views the computer consisting of a processor, 
internal memory (M), and external memory (disk). The external memory is 
considered unlimited in size and is divided into blocks of B consecutive data 
items. Transfer of a block of data between disk and RAM is called an I/O. 
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1.2 Contribution of the paper 

Shared Near Neighbor SNN [11] is a technique, in which similarity of two points 
is defined based on the number of common neighbors, the two points share. The 
main advantage of the shared near neighbor based clustering algorithms is that 
the number of clusters is not specified, it is auto generated by the algorithm. 
Document clustering, temporal time series clustering are few examples where 
SNN clustering technique is used. In this paper, SNN based clustering algorithm 
is designed in external memory model to make it scalable. The computational as 
well as the I/O complexity of the SNN algorithm is 0{N^k^). This is the reason, 
why the SNN algorithm is not I/O efficient, hence unsuitable for large data sets. 
The traditional SNN algorithm is designed in the external Memory Model to 
make it I/O efficient. We show that the I/O complexity of the proposed algorithm 
is 0{N^k^/BM) which is a BM factor improvement over the traditional SNN 
algorithm. The computational complexity remains same. Both traditional as well 
as proposed algorithms are implemented and the performance of the proposed 
algorithms is compared with its traditional counterpart. The proposed algorithm 
outperforms the in-core algorithm, as expected from the theoretical results. 


1.3 Organization of the paper 

This paper is organized as follows: In Section 2, the proposed scalable shared 
near neighbors based clustering algorithm and its I/O analysis is described. 
Section 3 contains the experimental results and observations. The concluding 
remarks and future works are given in Section 4. 

2 Proposed Scalable Clustering Algorithm based on SNN 

Shared Near Neighbor (SNN) is a technique in which similarity of two points 
is defined based on the number of neighbors, the two points share [11]. It can 
efficiently generate clusters of different sizes and shapes. 

The inputs of the SNN algorithm are two parameters: k (size of the nearest 
neighbors list) and 6 (similarity threshold). The performance of the algorithm 
depends upon these parameters. In [13] an analytical process was proposed to 
find the most appropriate values of the input parameters. 


2.1 Traditional Shared Near Neighbors Algorithm 

The SNN algorithm has two steps. In the first step the k-nearest neighbor of 
all the points are calculated. Distance between two points can be calculated 
using any one of the distance measures. The k-nearest neighbors of a point are 
arranged in ascending order. As each point is its own zeroth neighbor so first 
point of each neighborhood row indicates the point number itself. In the second 
step the shared near neighbor of each data point is calculated. Assume that i 
and j with i < j are any two points having at least 0 (similarity threshold) 
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matching neighbors and both points belong to each other’s neighborhood list. 
Then the bigger index, e.g., j is replaced by the smaller index, i. That means 
since i and j are similar so j is labeled as i. I/O complexity of this algorithm 
can be analyzed as follows: For the first step I/O complexity is 0{N‘^D) and 
for second step number of I/Os is 0{N‘^k^). Hence overall complexity of the 
traditional algorithms is 0{N‘^k^) which is same as computational complexity. 


2.2 Proposed Scalable Shared Near Neighbors Algorithm 

The traditional algorithm is not very I/O efficient, hence it is not suitable for 
large data sets. In this section, we design the traditional algorithm in the exter¬ 
nal memory model to make it I/O efficient, hence scalable for large data sets. 
The computational steps of the proposed algorithm is same as the traditional 
algorithm but the data access pattern is modified to make the algorithm scalable. 


Algorithm 1 Proposed Algorithm For Generating the K-nearest neighbors ma¬ 
trix^_ 

Input: Set of data points S € i.e., N points having dimension D and t is 

the block size. 

Output: knn[N][k] // k-Nearest Neighbors Matrix. _ 

t = M/2{D + k) 
for 1=0 to N/t — 1 do 
for j=0 to N/t — 1 do 

Read Si and Sj and also Read knrii 
Do the following computations in main memory, 
for I = {i)t to {i + l)t — 1 do 
for m = {j)t to (1 -I- l)t — 1 do 
sum = Distance(Z, m) 
if dist[l][k] > sum then dist[l][k] = sum 

for d = fc to 1 & dist[l][d]<dist[l][d-l] do 
Find the appropriate position in distil d 
matrix block and also swap the index 
value of that point into the different 
matrix block knni^ d 

Write the matrix block fcnnj, d into disk. 


Computation of k-Nearest Neighbor Matrix: First step of the algorithm 
is I/O efficient generation of the k-nearest neighbor matrix. Assume that the 
N X D dataset is partitioned into N/t blocks each of size t x D. Here t is a 
parameter to be fixed depending on the available main memory. Read any two 
blocks Si and Sj into main memory and calculate the distance between each pair 
of points in the main memory. Store the distance in a temporary vector called 
“dist” of size t x k and corresponding points index in knn matrix block of size 
txk. After computation of the k-nearest neighbor of the block Si, write the knn 
matrix block into the external memory. Repeat the process for N/t times. This 
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process will generate knn matrix and the procedure is described in Algorithm 1. 
The main memory contains 2-blocks of size t x D and 2-blocks of size t x k. 
Hence M = 2tD + 2tk. 

The clustering step: In the first step of the algorithm knn matrix of size 
N xk is generated which is the input to the next phase of the algorithm. Assume 
that the matrix is divided into N/t blocks of size t x k each. Also assume that 
label table of size N is divided into N/t blocks of size t each. Read any two 
blocks knui and knnj where i < j and also read two blocks of label table, labeh 
and label j into the main memory. Then find all possible pair points satisfying the 
SNN similarity criteria of knni and knnj blocks. In this way the label of all the 
points of the knnj block is calculated. Repeat the process for N/t times. This 
process will generate cluster labels. The procedure is described in Algorithm 2. 

Here the main memory contains 2-blocks of size t x k and 2-blocks of size t. 
Hence M = 2tk + 2t. The transfer of blocks between main memory and disk is 
shown Figure 1. 


Label table 


K-NN matrix 



N X 1 


l“block 



i^^block 



j^^hlock 




N X k 


Disk 


i^^hlock 


t X k 

i'^'block 


t X k 



t X 1 t X 1 


Main Memory 


Fig. 1: Snapshot of the transfer of blocks between disk and main memory 


2.3 I/O Analysis 

Traditional algorithm takes 0{N^k^) number of I/Os. I/O complexity of the 
proposed algorithm is described here. 
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Algorithm 2 Proposed Shared Near Neighbors Algorithm 

Input: 9 

//Similarity threshold. 

knn[N] [fc] 

//k-nearest neighbor matrix 

label[N] 

//cluster label of each point which is initialized as label[i] = i 

Output: label[Nl 

//Cluster labels. 


t = M/k 

for i = 0 to N/t — 1 do 
for j = i to N/t — 1 do 

Read matrix block knui and knrij 
Do the following computations in main memory 
for r = {i)t to {i + l)t — 1 do 
for I = {i)t to {i + l)t — 1 do 
for m = 0 to fc do 

if fcnn[r][0]==fcnn[l][m] then count=l 
for m = 0 to fc do 

if fcnn[Z][0]==fcnn[r][m] then count++ 
if count=—2 then 
for p = 1 to fc do 
for q = 1 to k do 

if fcnn[r][p] == fcnn[Z][g] then n++ 
if n > 9 then 
if i == j then 

Read the block labeli of size 1 x t into main memory 
if label[r\ > label[l\ then 
label\r\ = lahel[l] 
else 

label\l\ = label[r\ 

else 

Read the block labeh of size 1 x t into main memory 
if label[r\ > label[l\ then 
label\r\ = lahel[l] 
else 

label\l\ = label[r\ 

count = 0,n — 0 


I/O complexity of Algorithm 1 The main memory contains 2-blocks of size 
txD and 2-blocks of size txk. Hence M = 2tD+2tk, i.e., t = 0{M/{D + k)). To¬ 
tal number of I/Os required to generate the knn matrix is = 0{{ND/B)N/t) = 
0{{N^D/B){{D + k)/M)) = 0{{N^Dk + N^D^)/BM) 


I/O complexity of Algorithm 2 Here M = 2tk -I- 2t, i.e., t = 0{M/k). 
Total number of I/Os required by the algorithm is= 0{{{Nk + N)/B)N/t) = 
0{{N^k + N^)/tB) = 0{N^k^/BM) . 

So the total number of I/Os incurs by two phases of the algorithm is 0{{N‘^Dk+ 
+ N'^k'^)/BM). The dimension D is a constant, so ignoring the constant 



term , the I/O complexity of the algorithm is 0{N‘^k‘^)/BM) which is a BM 
factor improvement over the traditional algorithm. 


3 Experimental Results 

3.1 Performance of the Proposed Algorithm 

Many external memory software libraries are being designed. Few of them to 
mention are STXXL [6], LEDA-SM [5], TPIE [3]. STXXL is used in our imple¬ 
mentation. STXXL is the implementation or adaptation of C-I--I- STL (standard 
template library) for external memory computations [14]. Both the traditional 
and the proposed algorithms are implemented in STXXL [6]. 

Since both the algorithms follow exactly same computational steps, compu¬ 
tational complexity remains same and so both of them generate same set of 
clusters. Hence the quality analysis is omitted. Our main focus is on analyzing 
and reducing the I/O complexity of algorithm. 

The data sets are generated randomly. The dimension of the data is 64 and 
the size of the data set varies from 10000 to 200000. The main memory size is 
restricted to 1 MB only and the Hard disk size is 150 GB. The algorithm is 
implemented on Ubuntu 12.04 system with a 2.0 GHz CPU(Intel Core 2 Duo) 
and 2 GB main memory. 

For ease of implementation the algorithmic block size (t) is set same as the 
disk block size {B). When the block size is set to 8 KB and available main mem¬ 
ory is restricted to 1 MB total number of reads or writes goes beyond 500x10® 
for 10 X10^ data points in case of traditional algorithm. While in proposed al¬ 
gorithm number of read or writes is less than 100 x 10® even for 20 x 10^ data 
points. The similar results were obtained for total number of 1/Os and total 
data read and written. Total number of 1/Os for traditional algorithm exceeds 
900 X 10® for 10 X 10"^ data points while it is less than 150 x 10® for 20 x 10"^ 
data points in case of proposed algorithm. In-core algorithm fails to give result 
after 5 days for 15 x 10^ points. Figure 2a, 2b, 2c and 2d illustrate number of 
reads, writes, data R/Ws and I/Os respectively. 


3.2 Effect of Main Memory Size on the Performance of the 
Proposed Algorithm 

The proposed algorithm is run on different sizes of main memory to study the 
effect of main memory on the performance of the algorithm. The main memory 
is restricted to 1MB, 4MB, 16MB and 128MB. 

It is clear from the graph that the I/O reduces as the main memory increases. 
If we closely observe a graph we can see that when the number of data points is 
20 X 10‘*(« lOOMH) and main memory size is 128 MB, the line denoting the total 
number of I/O is very close to x-axis. Similar effect of main memory size can 
be seen in other graphs as well. That substantiated the theoretical I/O analysis 
as the I/O is dependent on available main memory size. Figure 3a, 3b and 3c 
illustrate number of reads, writes and I/Os respectively. 
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(a) Number of reads (b) Number of writes 



(c) Total Data read/write in GBs (d) Total number of I/Os 


Fig. 2: Performance of Proposed algorithm 


4 Conclusion 

This paper makes some contribution in the field of big data clustering by re¬ 
designing the existing algorithm in external memory model. The shared near 
neighbors (SNN) algorithm has been designed on external memory model. It 
is shown that the I/O complexity of the proposed algorithm is 0{N‘^k^/BM) 
which is a BM factor improvement over the traditional SNN algorithm. Both 
algorithms produce the same set of clusters. Both algorithms are implemented 
in STXXL to compare their performance with the in-core algorithms. The de¬ 
sign technique can be used to adapt various existing algorithms for large data. 
Without theoretical analysis it is often difficult to say which clustering algorithm 
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(a) Number of reads 


(b) Number of writes (c) Total number of I/Os 


Fig. 3: Effect of main Memory Size 


will perform better for different sized data sets. So one of our future work is to 
analyze the I/O complexity of the best known clustering algorithms of the lit¬ 
erature and design them on the external memory model to make them suitable 
for massive data sets. 
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